Skip to content

Introduce async scheduler implementation with mixin pattern#941

Draft
GOavi101 wants to merge 1 commit intotorch-spyre:mainfrom
GOavi101:feature/async-scheduler-mixin-pattern
Draft

Introduce async scheduler implementation with mixin pattern#941
GOavi101 wants to merge 1 commit intotorch-spyre:mainfrom
GOavi101:feature/async-scheduler-mixin-pattern

Conversation

@GOavi101
Copy link
Copy Markdown
Collaborator

@GOavi101 GOavi101 commented Apr 21, 2026

Description

Introduce async scheduler implementation with mixin pattern for cleaner architecture.

New Implementation (mixins)

  • PoolingSpyreMixin and ChunkedPrefillSpyreMixin classes
  • Runtime detection via _is_async_scheduler() (isinstance check)
  • Simple multiple inheritance for concrete classes:
    • class PoolingSpyreScheduler(PoolingSpyreMixin, Scheduler):
    • class AsyncPoolingSpyreScheduler(PoolingSpyreMixin, AsyncScheduler):
    • class ChunkedPrefillSpyreScheduler(ChunkedPrefillSpyreMixin, Scheduler):
    • class AsyncChunkedPrefillSpyreScheduler(ChunkedPrefillSpyreMixin, AsyncScheduler):

Related Issues

Test Plan

  • Added comprehensive unit tests in tests/v1/core/test_async_scheduler.py (16 tests):
    • TestIsAsyncScheduler: Verifies _is_async_scheduler() detection (4 tests)
    • TestPoolingSpyreMixinSchedule: Tests warmup-shape constraints in sync/async modes (4 tests)
    • TestChunkedPrefillSpyreMixinSchedule: Verifies constraint bypass in async mode (3 tests)
    • TestChunkedPrefillSpyreMixinUpdateFromOutput: Tests scheduler output filtering in async mode (5 tests)

Checklist

  • I have read the contributing guidelines
  • My code follows the project's code style (run bash format.sh)
  • I have added tests for my changes (if applicable)
  • I have updated the documentation (if applicable)
  • My commits include a Signed-off-by: line (DCO compliance)

@GOavi101 GOavi101 requested review from dilipgb and joerunde April 21, 2026 08:10
@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, run ./format.sh.
Now you are good to go 🚀.

We also recommend installing prek and configuring it to check your code before every local commit.

@GOavi101 GOavi101 force-pushed the feature/async-scheduler-mixin-pattern branch 15 times, most recently from 1a3ecbb to b0e8e83 Compare April 22, 2026 17:20
Comment thread vllm_spyre/v1/core/scheduler.py Outdated
SchedulerOutput = None

logger = init_logger(__name__)
from vllm_spyre.v1.core.scheduler_impl import (
Copy link
Copy Markdown
Collaborator

@joerunde joerunde Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GOavi101 it looks like most of this file has been deleted and moved to scheduler_impl. Can you put the implementation back in this file so that reviewers can see what's changed?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure joe

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I've looked through the tests but I'll wait to review the code changes until after this diff is in nicer shape- I don't really want to try to recreate the diff myself 😉

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!!!

@GOavi101 GOavi101 force-pushed the feature/async-scheduler-mixin-pattern branch from b0e8e83 to d71cfb3 Compare April 22, 2026 17:34
return EMPTY_MODEL_RUNNER_OUTPUT
cached = self._last_execute_model_output
self._last_execute_model_output = None
return cached if cached is not None else EMPTY_MODEL_RUNNER_OUTPUT
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we would actually run the sampling here - see related comment on the structured output PR: #903 (comment)

I'm fine with leaving this as-is and then fixing it to work with both async scheduling and structured outputs in a followup. Issue opened here: #947

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, thanks for opening the issue. I've added a TODO(#947) comment pointing to it so it's tracked directly in the code.

Comment thread tests/v1/core/test_async_scheduler.py Outdated
Key behaviours under test:
- _is_async_scheduler() correctly identifies async vs sync instances
- PoolingSpyreMixin.schedule() applies warmup-shape constraints in both modes
- ChunkedPrefillSpyreMixin.schedule() bypasses Spyre constraints in async mode
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement seems incorrect- we definitely can't just bypass spyre constraints because there are hard limits to what we can run on the cards. What's really going on?

Copy link
Copy Markdown
Collaborator Author

@GOavi101 GOavi101 Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, nothing is bypassed — that docstring was wrong. same constraints apply in both modes. only async-specific code is a stale ongoing_prefills cleanup needed because _update_after_schedule speculatively advances num_computed_tokens before update_from_output() confirms it. fixed the docstring.

Comment thread sendnn_inference/platform.py Outdated
is_pooling=True,
)
# Set as string path for vLLM's resolution (matches upstream behavior)
# Only convert to string if it's not already a string
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a class should be fine to pass here though, what goes wrong?

Copy link
Copy Markdown
Collaborator Author

@GOavi101 GOavi101 Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing goes wrong — you're right. SchedulerConfig.scheduler_cls is typed str | type | None and get_scheduler_cls() handles a class directly. string conversion was unnecessary. removed it in the latest push.

Comment thread sendnn_inference/platform.py Outdated
# The mixin's pre-filter pattern is not safe under that run-ahead scenario.
# For TP=1 (UniProcExecutor), futures are immediately done so it's safe.
if parallel_config.world_size > 1:
scheduler_config.async_scheduling = False
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting- if we wanted to support this feature then it would likely need to work with TP=4 which is how we run most models. I thought this was only incompatible with pipeline parallel upstream - does it also not work with tensor parallel?

Copy link
Copy Markdown
Collaborator Author

@GOavi101 GOavi101 Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joerunde

The fix is SpyreMultiprocExecutor — a thin MultiprocExecutor subclass that overrides max_concurrent_batches to return 1 instead of 2. This forces the engine to use the simpler step() path (strictly schedule → execute → update) rather than step_with_batch_queue, which was the only thing that broke TP>1.
Spyre's forward pass is synchronous, so there's no compute/schedule overlap to lose. The AsyncScheduler base class and its _update_after_schedule TTFT benefit are still fully active — we just removed the run-ahead that its state tracking couldn't handle.
So TP=1, TP=2, and TP=4 should all work with async scheduling now. Not a blocker.

what do you think?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't quite line up with my understanding- IIUC the step_with_batch_queue method is what works with the speculative scheduling: The engine runs the scheduler again while the model is running, assuming that the requests in the batch will continue.

Spyre's forward pass is synchronous, so there's no compute/schedule overlap to lose

I don't quite understand this either- the multiproc executor is definitely async, it broadcasts an RPC to the workers to run the model and the engine gets back a future that it waits on. step_with_batch_queue queues up that future so that it can speculatively schedule the next pass.

This TP=1 profile shows the scheduler running in between the model forward passes, the goal with async scheduling is to get the scheduler running for the next step during the model forward pass instead:

image

The AsyncScheduler base class and its _update_after_schedule TTFT benefit are still fully active — we just removed the run-ahead that its state tracking couldn't handle.
So TP=1, TP=2, and TP=4 should all work with async scheduling now. Not a blocker.

Based on the above, my understanding is that the run-ahead state is the whole point and we won't gain any performance benefit from this unless we support it, so this is a blocker. Is there something else I'm missing?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, thanks for the correction. I'll fix this — snapshot the mixin's mutable state (ongoing_prefills, tkv, previous_step_was_prefill) before delegating to super().schedule() so the run-ahead second schedule() call sees consistent state, and remove SpyreMultiprocExecutor. That way TP≥2 gets the full async scheduling benefit.

@joerunde
Copy link
Copy Markdown
Collaborator

Thanks @GOavi101!

A few notes:

  1. If this can't be done with tensor parallel, then maybe it's not worth pursuing. Is that a hard blocker?
  2. We need to have an end-to-end test that shows this working, ie using an LLM with async scheduling enabled. It would also be good to include an illustrative test at the engine level (see https://github.com/torch-spyre/sendnn-inference/blob/main/tests/e2e/test_spyre_pc_scheduler_steps.py) that shows the effects of async scheduling. From my quick skim it sounds like the engine is speculatively scheduling batches one step ahead, so we should see a "dead token" in some cases where the engine schedules a decode past the end of a sequence.
  3. It would be really great to see a profile of this in action, or at least some minimal vllm bench results showing what kind of performance improvement we can expect.

@GOavi101 GOavi101 force-pushed the feature/async-scheduler-mixin-pattern branch 5 times, most recently from 1bd875b to 2246d48 Compare April 23, 2026 10:03
@GOavi101
Copy link
Copy Markdown
Collaborator Author

GOavi101 commented Apr 23, 2026

After vLLM 0.14, the async scheduler is enabled by default. All the tests below are running using the async scheduler. To run with the synchronous scheduler:
we have to add the --no-async-scheduling flag.

@GOavi101 GOavi101 force-pushed the feature/async-scheduler-mixin-pattern branch 3 times, most recently from 7b0a718 to fb7ee62 Compare April 23, 2026 20:09
Replace _create_pooling_scheduler() and _create_chunked_prefill_scheduler()
factory functions with PoolingSpyreMixin and ChunkedPrefillSpyreMixin classes.

Each mixin uses _is_async_scheduler() (isinstance check) to detect the concrete
base class at runtime and adjust behaviour accordingly, instead of capturing
is_async via a closure variable.

Concrete classes use simple multiple inheritance:

  class PoolingSpyreScheduler(PoolingSpyreMixin, Scheduler): pass
  class AsyncPoolingSpyreScheduler(PoolingSpyreMixin, AsyncScheduler): pass
  class ChunkedPrefillSpyreScheduler(ChunkedPrefillSpyreMixin, Scheduler): pass
  class AsyncChunkedPrefillSpyreScheduler(ChunkedPrefillSpyreMixin, AsyncScheduler): pass

Side effects:
- __module__/__name__/__qualname__ fixup blocks removed (no longer needed)
- _async_warning_logged flag removed (debug log emitted each call is fine)
- TYPE_CHECKING import removed (unused after refactor)

Signed-off-by: Avishek Goswami <avishek.goswami@ibm.com>
@GOavi101 GOavi101 force-pushed the feature/async-scheduler-mixin-pattern branch from fb7ee62 to c5db31a Compare April 24, 2026 05:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants